pyramidal topology
Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
Recent works have shown that gradient descent can find a global minimum for over-parameterized neural networks where the widths of all the hidden layers scale polynomially with N (N being the number of training samples). In this paper, we prove that, for deep networks, a single layer of width N following the input layer suffices to ensure a similar guarantee. In particular, all the remaining layers are allowed to have constant widths, and form a pyramidal topology. We show an application of our result to the widely used Xavier's initialization and obtain an over-parameterization requirement for the single wide layer of order N^2.
Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
Recent works have shown that gradient descent can find a global minimum for over-parameterized neural networks where the widths of all the hidden layers scale polynomially with N (N being the number of training samples). In this paper, we prove that, for deep networks, a single layer of width N following the input layer suffices to ensure a similar guarantee. In particular, all the remaining layers are allowed to have constant widths, and form a pyramidal topology. We show an application of our result to the widely used Xavier's initialization and obtain an over-parameterization requirement for the single wide layer of order N 2.
Review for NeurIPS paper: Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
This paper shows global convergence of gradient descent for deep neural networks that has wide first layer followed by pyramidal shape layers. It shows that an unconventional initialization with width N (data size) of the first layer suffices to show global convergence, which is much smaller than the required width for usual Xavier initialization. The presented result improves existing results greatly; the global convergence for width N is a great improvement from existing results. That is a valuable result. It is encouraged to add more detailed discussions about connection to existing NTK theories and possibilities of relaxing the assumptions maid in the analysis.
Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
Recent works have shown that gradient descent can find a global minimum for over-parameterized neural networks where the widths of all the hidden layers scale polynomially with N (N being the number of training samples). In this paper, we prove that, for deep networks, a single layer of width N following the input layer suffices to ensure a similar guarantee. In particular, all the remaining layers are allowed to have constant widths, and form a pyramidal topology. We show an application of our result to the widely used Xavier's initialization and obtain an over-parameterization requirement for the single wide layer of order N 2.
Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
Nguyen, Quynh, Mondelli, Marco
A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.